# EE488 Computer Architecture Lecture 4: MIPS ISA

2024 Summer

#### **Today's Lecture**

- Quick Review of Last Lecture
- Basic ISA Decisions and Design
- Announcements
- Operations
- Instruction Sequencing
- Delayed Branch
- Procedure Calling

## **Quick Review of Last Lecture**

#### **Comparing Number of Instructions**

Code sequence for (C = A + B) for four classes of instruction sets:

| Stack  | Accumulator | Register<br>(register-memory) | Register<br>(load-store) |
|--------|-------------|-------------------------------|--------------------------|
| Push A | Load A      | Load R1,A                     | Load R1,A                |
| Push B | Add B       | Add R1,B                      | Load R2,B                |
| Add    | Store C     | Store C, R1                   | Add R3,R1,R2             |
| Pop C  |             |                               | Store C,R3               |

$$ExecutionTime = \frac{1}{Performance} = Instructions \times \frac{Cycles}{Instruction} \times \frac{Seconds}{Cycle}$$

#### **General Purpose Registers Dominate**

- 1975-2002 all machines use general purpose registers
- Advantages of registers
  - Registers are faster than memory
  - Registers compiler technology has evolved to efficiently generate code for register files
    - E.g., (A\*B) (C\*D) (E\*F) can do multiplies in any order vs. stack
  - Registers can hold variables
    - Memory traffic is reduced, so program is sped up (since registers are faster than memory)
  - Code density improves (since register named with fewer bits than memory location)
  - Registers imply operand locality

#### **Operand Size Usage**



Support for these data sizes and types:
 8-bit, 16-bit, 32-bit integers and
 32-bit and 64-bit IEEE 754 floating point numbers

#### Typical Operations (little change since 1960)

Data Movement

Load (from memory)

Store (to memory)

memory-to-memory move
register-to-register move
input (from I/O device)
output (to I/O device)
push, pop (to/from stack)

Arithmetic integer (binary + decimal) or FP Add, Subtract, Multiply, Divide



Shift

Logical

Control (Jump/Branch)

Subroutine Linkage

Interrupt

Synchronization

String
Graphics (MMX)

shift left/right, rotate left/right

not, and, or, set, clear

unconditional, conditional

call, return

trap, return

test & set (atomic r-m-w)

search, translate

parallel subword ops (4 16bit add)



#### **Addressing Modes**

Addressing modes specify a constant, a register, or a location in memory

```
- Register add r1, r2 r1 <- r1+r2

- Immediate add r1, #5 r1 <- r1+r5

- Direct add r1, (0x200) r1 <- r1+M[0x200]

- Register indirect add r1, (r2) r1 <- r1+M[r2]

- Displacement add r1, 100 (r2) r1 <- r1+M[r2+r00]

- Indexed add r1, (r2+r3) r1 <- r1+M[r2+r3]

- Scaled add r1, (r2+r3*4) r1 <- r1+M[r2+r3*4]

- Memory indirect add r1, @(r2) r1 <- r1+M[m[r2]]

- Auto-increment add r1, (r2)+ r1 <- r1+M[r2], r2++

- Auto-decrement add r1, -(r2) r2--, r1 <- r1+M[r2]
```

Complicated modes reduce instruction count at the cost of complex implementations

#### **Instruction Sequencing**

- The next instruction to be executed is typically implied
  - Instructions execute sequentially
  - Instruction sequencing increments a Program Counter



- Sequencing flow is disrupted conditionally and unconditionally
  - The ability of computers to test results and conditionally instructions is one of the reasons computers have become so useful



#### **Instruction Set Design Metrics**

- Static Metrics
  - How many bytes does the program occupy in memory?
- Dynamic Metrics
  - How many instructions are executed?
  - How many bytes does the processor fetch to execute the program?
  - How many clocks are required per instruction?

How "lean" a clock is practical?

• Execution Time = 
$$\frac{1}{Performance}$$
 = Instructions ×  $\frac{Cycles}{Instruction}$  ×  $\frac{Seconds}{Cycle}$  Instruction Count Cycle Time

CPI

#### MIPS R2000 / R3000 Registers

• Programmable storage

| r0  | 0 |
|-----|---|
| r1  |   |
| 0   |   |
| 0   |   |
| 0   |   |
| r31 |   |
| PC  |   |
| lo  |   |
| hi  |   |

#### MIPS Addressing Modes/Instruction Formats

#### All instructions 32 bits wide



#### MIPS R2000 / R3000 Operation Overview

- Arithmetic logical
- Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU
- Addl, AddlU, SLTI, SLTIU, Andl, Orl, Xorl, LUI
- SLL, SRL, SRA, SLLV, SRLV, SRAV
- Memory Access
- LB, LBU, LH, LHU, LW, LWL,LWR
- SB, SH, SW, SWL, SWR

#### **Multiply / Divide**

- Start multiply, divide
  - MULT rs, rt
  - MULTU rs, rt
  - DIV rs, rt
  - DIVU rs, rt
- Move result from multiply, divide
  - MFHI rd
  - MFLO rd
- Move to HI or LO
  - MTHI rd
  - MTLO rd



#### **Multiply / Divide**

- Start multiply, divide
  - MULT rs, rt
- Move to HI or LO
  - MTHI rd
  - MTLO rd
- Why not Third field for destination? (Hint: how many clock cycles for multiply or divide vs. add?)



## **MIPS arithmetic instructions**

| <i>Instruction</i> | Example           | Meaning               | Comments                         |
|--------------------|-------------------|-----------------------|----------------------------------|
| add                | add \$1,\$2,\$3   | \$1 = \$2 + \$3       | 3 operands; exception possible   |
| subtract           | sub \$1,\$2,\$3   | 1 = 2 - 3             | 3 operands; exception possible   |
| add immediate      | addi \$1,\$2,100  | 1 = 2 + 100           | + constant; exception possible   |
| add unsigned       | addu \$1,\$2,\$3  | 1 = 2 + 3             | 3 operands; no exceptions        |
| subtract unsigned  | subu \$1,\$2,\$3  | 1 = 2 - 3             | 3 operands; <u>no exceptions</u> |
| add imm. unsign.   | addiu \$1,\$2,100 | 1 = 2 + 100           | + constant; no exceptions        |
| multiply           | mult \$2,\$3      | Hi, Lo = $2 \times 3$ | 64-bit signed product            |
| multiply unsigned  | multu \$2,\$3     | Hi, Lo = $2 \times 3$ | 64-bit unsigned product          |
| divide             | div \$2,\$3       | $Lo = \$2 \div \$3,$  | Lo = quotient, Hi = remainder    |
|                    |                   | $Hi = $2 \mod $3$     |                                  |
| divide unsigned    | divu \$2,\$3      | $Lo = \$2 \div \$3,$  | Unsigned quotient & remainder    |
|                    |                   | $Hi = $2 \mod $3$     |                                  |
| Move from Hi       | mfhi \$1          | \$1 = Hi              | Used to get copy of Hi           |
| Move from Lo       | mflo \$1          | \$1 = Lo              | Used to get copy of Lo           |

## **MIPS logical instructions**

| Instruction         | Example           | Meaning              | Comment                        |
|---------------------|-------------------|----------------------|--------------------------------|
| and                 | and \$1,\$2,\$3   | \$1 = \$2 & \$3      | 3 reg. operands; Logical AND   |
| or                  | or \$1,\$2,\$3    | \$1 = \$2   \$3      | 3 reg. operands; Logical OR    |
| xor                 | xor \$1,\$2,\$3   | \$1 = \$2 \oplus \$3 | 3 reg. operands; Logical XOR   |
| nor                 | nor \$1,\$2,\$3   | \$1 = ~(\$2  \$3)    | 3 reg. operands; Logical NOR   |
| and immediate       | andi \$1,\$2,10   | \$1 = \$2 & 10       | Logical AND reg, constant      |
| or immediate        | ori \$1,\$2,10    | \$1 = \$2   10       | Logical OR reg, constant       |
| xor immediate       | xori \$1, \$2,10  | \$1 = ~\$2 &~10      | Logical XOR reg, constant      |
| shift left logical  | sll \$1,\$2,10    | \$1 = \$2 << 10      | Shift left by constant         |
| shift right logical | srl \$1,\$2,10    | \$1 = \$2 >> 10      | Shift right by constant        |
| shift right arithm. | sra \$1,\$2,10    | \$1 = \$2 >> 10      | Shift right (sign extend)      |
| shift left logical  | sllv \$1,\$2,\$3  | \$1 = \$2 << \$3     | Shift left by variable         |
| shift right logical | srlv \$1,\$2, \$3 | \$1 = \$2 >> \$3     | Shift right by variable        |
| shift right arithm. | srav \$1,\$2, \$3 | \$1 = \$2 >> \$3     | Shift right arith. by variable |

#### MIPS data transfer instructions

| <u>Instruction</u> | <u>Comment</u> |  |
|--------------------|----------------|--|
| SW 500(R4), R3     | Store word     |  |
| SH 502(R2), R3     | Store half     |  |
| SB 41(R3), R2      | Store byte     |  |
|                    |                |  |
| LW R1. 30(R2)      | Load word      |  |

LH R1, 40(R3) **Load halfword** 

LHU R1, 40(R3) Load halfword unsigned

LB R1, 40(R3) Load byte

LBU R1, 40(R3) Load byte unsigned



#### **Methods of Testing Condition**

Condition Codes

Processor status bits are set as a side-effect of arithmetic instructions (possibly on Moves) or explicitly by compare or test instructions.

ex: add r1, r2, r3 bz label

Condition Register

Ex: cmp r1, r2, r3 bgt r1, label

Compare and Branch

Ex: bgt r1, r2, label

#### **Condition Codes**

Setting CC as side effect can reduce the # of instructions

#### But also has disadvantages:

- --- not all instructions set the condition codes; which do and which do not often confusing! e.g., shift instruction sets the carry bit
- --- dependency between the instruction that sets the CC and the one that tests it: to overlap their execution, may need to separate them with an instruction that does not change the CC

| ifetch                 | read   | compute | write       |       |
|------------------------|--------|---------|-------------|-------|
| Old CC read New CC cor |        |         | CC computed |       |
|                        | ifetch | read    | compute     | write |

#### **Compare and Branch**

Compare and Branch

BEQ rs, rt, offset if R[rs] == R[rt] then PC-relative branch

• BNE rs, rt, offset <>0

Compare to zero and Branch

BLEZ rs, offset if R[rs] <= 0 then PC-relative branch</li>

BGTZ rs, offset >0

• BLT <0

• BGEZ >=0

BLTZAL rs, offset if R[rs] < 0 then branch and link (into R 31)</li>

• BGEZAL >=0

- Remaining set of compare and branch take two instructions
- Almost all comparisons are against zero!

### MIPS jump, branch, compare instructions

| Instruction         | Example                              | Meaning                                                    |
|---------------------|--------------------------------------|------------------------------------------------------------|
| branch on equal     | beq \$1,\$2,100<br>Equal test; PC re | if (\$1 == \$2) go to PC+4+100 elative branch              |
| branch on not eq.   | bne \$1,\$2,100<br>Not equal test; F | if (\$1!= \$2) go to PC+4+100<br>PC relative               |
| set on less than    | slt \$1,\$2,\$3<br>Compare less th   | if (\$2 < \$3) \$1=1; else \$1=0<br>an; 2's comp.          |
| set less than imm.  | slti \$1,\$2,100<br>Compare < cons   | if (\$2 < 100) \$1=1; else \$1=0 stant; 2's comp.          |
| set less than uns.  |                                      | if (\$2 < \$3) \$1=1; else \$1=0<br>an; natural numbers    |
| set I. t. imm. uns. |                                      | if (\$2 < 100) \$1=1; else \$1=0<br>stant; natural numbers |
| jump                | j 10000<br>Jump to target a          | •                                                          |
| jump register       | jr \$31<br>For switch, proce         | •                                                          |
| jump and link       | jal 10000<br>For procedure ca        | \$31 = PC + 4; go to 10000<br>all                          |

#### Signed vs. Unsigned Comparison

R1= 0...00 0000 0000 0000 0001<sub>2</sub>
R2= 0...00 0000 0000 0000 0010<sub>2</sub>
R3= 1...11 1111 1111 1111<sub>2</sub>

Value?
2's complement?
Unsigned?

° After executing these instructions:

```
slt r4,r2,r1 ; if (r2 < r1) r4=1; else r4=0
slt r5,r3,r1 ; if (r3 < r1) r5=1; else r5=0
sltu r6,r2,r1 ; if (r2 < r1) r6=1; else r6=0
sltu r7,r3,r1 ; if (r3 < r1) r7=1; else r7=0</pre>
```

° What are values of registers r4 - r7? Why?

$$r4 = ; r5 = ; r6 = ; r7 = ;$$

#### **Calls: Why Are Stacks So Great?**

Stacking of Subroutine Calls & Returns and Environments:



Some machines provide a memory stack as part of the architecture (e.g., VAX)

Sometimes stacks are implemented via software convention (e.g., MIPS)

#### **Memory Stacks**

Useful for stacked environments/subroutine call & return even if operand stack not part of architecture

Stacks that Grow Up vs. Stacks that Grow Down:



How is empty stack represented?

<u>Little --> Big/Last Full</u>

POP: Read from Mem(SP)

**Decrement SP** 

**PUSH: Increment SP** 

Write to Mem(SP)

**Little --> Big/Next Empty** 

POP: Decrement SP

Read from Mem(SP)

**PUSH:** Write to Mem(SP)

**Increment SP** 

#### **Call-Return Linkage: Stack Frames**



- Many variations on stacks possible (up/down, last pushed / next )
- Block structured languages contain link to lexically enclosing frame
- ° Compilers normally keep scalar variables in registers, not memory!

## **Call-Return Linkage: Stack Frames**



#### **MIPS: Software conventions for Registers**

```
0
    zero constant 0
                                     16
                                              callee saves
        reserved for assembler
                                     ... (caller can clobber)
    at
        expression evaluation &
                                     23 s7
2
    V0
3
        function results
    v1
                                     24
                                         t8
                                              temporary (cont'd)
4
                                     25
    a0
                                         t9
        arguments
5
    a1
                                     26
                                         k0 reserved for OS kernel
6
    a2
                                     27
                                         k1
                                              Pointer to global area
    a3
                                     28
                                         gp
8
        temporary: caller saves
    t0
                                     29
                                              Stack pointer
                                         sp
        (callee can clobber)
                                     30
                                         fp
                                              frame pointer
                                     31
15
                                              Return Address (HW)
   t7
                                         ra
```

Plus a 3-deep stack of mode bits.

#### **Example in C: swap**

```
swap(int v[], int k)
{
  int temp;
  temp = v[k];
  v[k] = v[k+1];
  v[k+1] = temp;
}
```

- ° Assume swap is called as a procedure
- Assume temp is register \$15; arguments in \$a1, \$a2; \$16 is scratch reg:
- Write MIPS code

#### swap: MIPS

#### swap:

```
addiu $sp,$sp, -4 ; create space on stack
      $16, 4($sp)
                     ; callee saved register put onto stack
SW
sll
     $t2, $a2,2
                     ; mulitply k by 4
addu $t2, $a1,$t2
                     ; address of v[k]
lw
      $15, 0($t2)
                     ; load v[k]
      $16, 4($t2)
lw
                     ; load v[k+1]
                     ; store v[k+1] into v[k]
     $16, 0($t2)
SW
     $15, 4($t2)
                     ; store old value of v[k] into v[k+1]
SW
lw
      $16, 4($sp)
                      ; callee saved register restored from stack
addiu $sp,$sp, 4
                      ; restore top of stack
                      ; return to place that called swap
      $31
jr
```

#### **Delayed Branches**

```
li r3, #7
sub r4, r4, 1
bz r4, LL
addi r5, r3, 1
subi r6, r6, 2
LL: slt r1, r3, r5
```

- o In the "Raw" MIPS the instruction after the branch is executed even when the branch is taken?
  - This is hidden by the assembler for the MIPS "virtual machine"
  - allows the compiler to better utilize the instruction pipeline (???)

#### **Branch & Pipelines**



By the end of Branch instruction, the CPU knows whether or not the branch will take place.

However, it will have fetched the next instruction by then, regardless of whether or not a branch will be taken.

Why not execute it?

#### Filling Delayed Branches



- •Compiler can fill a single delay slot with a useful instruction 50% of the time.
  - try to move down from above jump
  - move up from target, if safe

add r3, r1, r2
sub r4, r4, 1
bz r4, LL
NOP

LL: add rd, ...

Is this violating the ISA abstraction?

#### **Standard and Delayed Interpretation**

PC

add rd, rs, rt

 $R[rd] \leftarrow R[rs] + R[rt];$ 

 $PC \leftarrow PC + 4$ ;

beg rs, rt, offset

if R[rs] == R[rt] then PC <- PC + SX(offset)

else PC <- PC + 4;

sub rd, rs, rt

. . .

. . .

L1:

target

PC

nPC

add rd, rs, rt

 $R[rd] \leftarrow R[rs] + R[rt];$ 

 $PC \leftarrow nPC$ ;  $nPC \leftarrow nPC + 4$ ;

beq rs, rt, offset

if R[rd] == R[rt] then nPC <- nPC + SX(offset)

else nPC <- nPC + 4;

PC <- nPC

sub rd, rs, rt

. . .

. . .

L1:

target

**Delayed Loads?** 

#### **Delayed Branches (cont.)**

#### **Execution History**



Branches are the bane (or pain!) of pipelined machines
Delayed branches complicate the compiler slightly, but make pipelining
easier to implement and more effective
Good strategy to move some complexity to compile time

#### **Details of the MIPS instruction set**

- Register zero always has the value zero (even if you try to write it)
- Branch and jump instructions put the return address PC+4 into the link register
- All instructions change all 32 bits of the destination register (including lui, lb, lh) and all read all 32 bits of sources (add, sub, and, or, ...)
- Immediate arithmetic and logical instructions are extended as follows:
  - logical immediates are zero extended to 32 bits
  - arithmetic immediates are sign extended to 32 bits
- The data loaded by the instructions lb and lh are extended as follows:
  - · Ibu, Ihu are zero extended
  - · Ib, Ih are sign extended
- Overflow can occur in these arithmetic and logical instructions:
  - add, sub, addi
  - it <u>cannot</u> occur in addu, subu, addiu, and, or, xor, nor, shifts, mult, multu, div, divu

#### **Other ISAs**

- Intel 8086/88 => 80286 => 80386 => 80486 => Pentium => P6
  - 8086 few transistors to implement 16-bit microprocessor
  - tried to be somewhat compatible with 8-bit microprocessor 8080
  - successors added features which were missing from 8086 over next 15 years
  - product several different intel engineers over 10 to 15 years
  - Announced 1978
- VAX simple compilers & small code size =>
  - efficient instruction encoding
  - powerful addressing modes
  - powerful instructions
  - few registers
  - product of a single talented architect
  - Announced 1977

#### **Machine Examples: Address & Registers**

Intel 8086

2<sup>20</sup> x 8 bit bytes
AX, BX, CX, DX
SP, BP, SI, DI
CS, SS, DS
IP, Flags

acc, index, count, quot stack, string code,stack,data segment

**VAX 11** 

2<sup>32</sup> x 8 bit bytes 16 x 32 bit GPRs r15-- program counter r14-- stack pointer r13-- frame pointer r12-- argument ptr

MC 68000

2<sup>24</sup> x 8 bit bytes 8 x 32 bit GPRs 7 x 32 bit addr reg 1 x 32 bit SP 1 x 32 bit PC

**MIPS** 

2<sup>32</sup> x 8 bit bytes 32 x 32 bit GPRs 32 x 32 bit FPRs HI, LO, PC

#### **Details of the MIPS instruction set**

- Register zero <u>always</u> has the value <u>zero</u> (even if you try to write it)
- Branch/jump and link put the return addr. PC+4 into the link register (R31)
- All instructions change <u>all 32 bits</u> of the destination register (including lui, lb, lh) and all read all 32 bits of sources (add, sub, and, or, ...)
- Immediate arithmetic and logical instructions are extended as follows:
  - logical immediates ops are zero extended to 32 bits
  - arithmetic immediates ops are sign extended to 32 bits (including addu)
- The data loaded by the instructions lb and lh are extended as follows:
  - · Ibu, Ihu are zero extended
  - Ib, Ih are sign extended
- Overflow can occur in these arithmetic and logical instructions:
  - · add, sub, addi
  - it <u>cannot</u> occur in addu, subu, addiu, and, or, xor, nor, shifts, mult, multu, div, divu

#### **Summary**

- Use general purpose registers with a load-store architecture: YES
- Provide at least 16 general purpose registers plus separate floatingpoint registers: 31 GPR & 32 FPR
- Support these addressing modes: displacement (with an address offset size of 12 to 16 bits), immediate (size 8 to 16 bits), and register deferred;
   YES: 16 bits for immediate, displacement (disp=0 => register deferred)
- All addressing modes apply to all data transfer instructions: YES
- Use fixed instruction encoding if interested in performance and use variable instruction encoding if interested in code size : Fixed
- Support these data sizes and types: 8-bit, 16-bit, 32-bit integers and 32-bit and 64-bit IEEE 754 floating point numbers: YES
- Support these simple instructions, since they will dominate the number of instructions executed: load, store, add, subtract, move registerregister, and, shift, compare equal, compare not equal, branch (with a PC-relative address at least 8-bits long), jump, call, and return: YES, 16b
- Aim for a minimalist instruction set: YES

#### **Summary: Salient features of MIPS R3000**

- 32-bit fixed format inst (3 formats)
- 32 32-bit GPR (R0 contains zero) and 32 FP registers (and HI LO)
  - partitioned by software convention
- 3-address, reg-reg arithmetic instr.
- Single address mode for load/store: base+displacement
  - no indirection
  - 16-bit immediate plus LUI
- Simple branch conditions
  - compare against zero or two registers for =
  - no condition codes
- Delayed branch
  - execute instruction after the branch (or jump) even if the banch is taken (Compiler can fill a delayed branch with useful work about 50% of the time)